martingale property
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLMReasoning
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates cannot be predicted from solely the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, signaling a deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains, including event forecasting, value-laden questions, and academic paper review, we found such violations to be widespread across models, reasoning paradigms, problem domains, and system prompts, where the future beliefs are consistently predictable from the model's current belief, a phenomenon which we term belief entrenchment. Through comprehensive experiments, we identify the models (e.g., GPT-4o), reasoning techniques (e.g., chain of thought), and domains (e.g., forecasting) more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of the LLM reasoning process.
Volatility Modeling via EWMA-Driven Time-Dependent Hurst Parameters
We introduce a novel rough Bergomi (rBergomi) model featuring a variance-driven exponentially weighted moving average (EWMA) time-dependent Hurst parameter $H_t$, fundamentally distinct from recent machine learning and wavelet-based approaches in the literature. Our framework pioneers a unified rough differential equation (RDE) formulation grounded in rough path theory, where the Hurst parameter dynamically adapts to evolving volatility regimes through a continuous EWMA mechanism tied to instantaneous variance. Unlike discrete model-switching or computationally intensive forecasting methods, our approach provides mathematical tractability while capturing volatility clustering and roughness bursts. We rigorously establish existence and uniqueness of solutions via rough path theory and derive martingale properties. Empirical validation on diverse asset classes including equities, cryptocurrencies, and commodities demonstrates superior performance in capturing dynamics and out-of-sample pricing accuracy. Our results show significant improvements over traditional constant-Hurst models.
LLMs are Bayesian, in Expectation, not in Realization
Chlon, Leon, Rashidi, Sarah, Khamis, Zein, Awada, MarcAntonio M.
Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $Θ(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = Θ(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.
Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning
Jangjoo, Fariba, Marsili, Matteo, Roudi, Yasser
Closed-loop learning is the process of repeatedly estimating a model from data generated from the model itself. It is receiving great attention due to the possibility that large neural network models may, in the future, be primarily trained with data generated by artificial neural networks themselves. We study this process for models that belong to exponential families, deriving equations of motions that govern the dynamics of the parameters. We show that maximum likelihood estimation of the parameters endows sufficient statistics with the martingale property and that as a result the process converges to absorbing states that amplify initial biases present in the data. However, we show that this outcome may be prevented if the data contains at least one data point generated from a ground truth model, by relying on maximum a posteriori estimation or by introducing regularisation.
Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
Nagler, Thomas, Rügamer, David
Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient sampling procedure to construct Bayesian posteriors for such estimates based on Martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the uncertainty quantification of our method in inference applications.
Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective
Falck, Fabian, Wang, Ziyu, Holmes, Chris
In-context learning (ICL) has emerged as a particularly remarkable characteristic of Large Language Models (LLM): given a pretrained LLM and an observed dataset, LLMs can make predictions for new data points from the same distribution without fine-tuning. Numerous works have postulated ICL as approximately Bayesian inference, rendering this a natural hypothesis. In this work, we analyse this hypothesis from a new angle through the martingale property, a fundamental requirement of a Bayesian learning system for exchangeable data. We show that the martingale property is a necessary condition for unambiguous predictions in such scenarios, and enables a principled, decomposed notion of uncertainty vital in trustworthy, safety-critical systems. We derive actionable checks with corresponding theory and test statistics which must hold if the martingale property is satisfied. We also examine if uncertainty in LLMs decreases as expected in Bayesian learning when more data is observed. In three experiments, we provide evidence for violations of the martingale property, and deviations from a Bayesian scaling behaviour of uncertainty, falsifying the hypothesis that ICL is Bayesian.
On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization
Lai, Chieh-Hsin, Takida, Yuhta, Uesaka, Toshimitsu, Murata, Naoki, Mitsufuji, Yuki, Ermon, Stefano
It refers to a (diffusion) model that is explicitly designed to align with The emergence of various notions of "consistency" the underlying trajectory defined by an ordinary differential in diffusion models has garnered considerable equation (ODE), stochastic differential equation (SDE), or attention and helped achieve improved sample partial differential equation (PDE). In this study, we aim quality, likelihood estimation, and accelerated to provide a theoretical investigation into the relationships sampling. Although similar concepts have between these three consistency-type models. Under certain been proposed in the literature, the precise relationships mild assumptions, we will rigorously establish the equivalence among them remain unclear. In this of these independently developed concepts.
Probability Paths and the Structure of Predictions over Time
Lin, Zhiyuan, Sheng, Hao, Goel, Sharad
In settings ranging from weather forecasts to political prognostications to financial projections, probability estimates of future binary outcomes often evolve over time. For example, the estimated likelihood of rain on a specific day changes by the hour as new information becomes available. Given a collection of such probability paths, we introduce a Bayesian framework -- which we call the Gaussian latent information martingale, or GLIM -- for modeling the structure of dynamic predictions over time. Suppose, for example, that the likelihood of rain in a week is 50%, and consider two hypothetical scenarios. In the first, one expects the forecast is equally likely to become either 25% or 75% tomorrow; in the second, one expects the forecast to stay constant for the next several days. A time-sensitive decision-maker might select a course of action immediately in the latter scenario, but may postpone their decision in the former, knowing that new information is imminent. We model these trajectories by assuming predictions update according to a latent process of information flow, which is inferred from historical data. In contrast to general methods for time series analysis, this approach preserves the martingale structure of probability paths and better quantifies future uncertainties around probability paths. We show that GLIM outperforms three popular baseline methods, producing better estimated posterior probability path distributions measured by three different metrics. By elucidating the dynamic structure of predictions over time, we hope to help individuals make more informed choices.